Overview and motivation

A single text document often has multiple semantic aspects. A single news article related to politics may have aspects related to trade, technology and defense. Therefore, often a document needs to be tagged to multiple labels/categories, instead of a single category. Most of the classification algorithms deal with datasets which have a set of input features and only one output class. However, in reality the problem might be different from a typical binary or multiclass classification, as often a document or an image can be associated with multiple categories rather than a single category.

An introduction of enormous amount of documents belonging to multiple categories in the legal domain, makes it an attractive area for employing automated solutions. In this project we explore a public multi labelled legal text dataset that has been manually annotated over a decade. It contains laws related to the European Union, including treaties, legislation, case-law and legislative proposals, in 24 different languages. This is popularly known as the EUR-Lex dataset containing about twenty five thousand documents, around seven thousand labels and in several European languages. A skewed distribution of multiple labels per document, along with existence of the same data in multiple languages, makes this dataset an interesting proposition. Few publications have used an older version of the dataset which had around four thousand labels. The ones that have used this have reported relatively poor F1 values in the range of 40% (referred to as EUROVOC)[1] (which may be fair, given the high number of labels). There is no publications for the new dataset (having around 7000 labels), which motivates us to explore the problem of multilabel classification on this dataset.

Multilable v/s Multiclass classification
In multi-label classification, each instance in the training set is associated with a set of labels, instead of a single label, and the task is to predict the label-sets of unseen instances, instead of a single label. There is a difference between multi-class classification and multi-label classification. In multi-class problem the classes or labels are mutually-exclusive, i.e. it makes the assumption that each instance can be assigned to only one label. E.g - an animal can be either a dog or a cat but not both. But in multi-label problem multiple labels may be assigned to an instance. E.g - a movie can belong to a comedy genre as well a detective genre.

Project objectives

Can we use machine learning techniques to automatically annotate legal documents?

To answer the question we need to answer some research questions:

  • How well the classifiers perform over Eur-Lex dataset for two languages (English and Deutsch).
  • How the classifiers’ performance changes with different features- one with term frequency–inverse document frequency(tf-idf), another with term incidence.
  • Which flavour of multilabel transform algorithm perform best among all, the one which considers label correlation or the one which does not.
  • How the classifiers’ performance changes when the number of labels is reduced.

Design overview (algorithms and methods)

  • Pre-processing:
    • Exclude stop words, perform lemmatization.
    • Extract features - term frequency–inverse document frequency(tf-idf) and term incidence.
    • Generate the MLD [2] data format, which is needed for multi label data exploration and classification using mldr [3] and utiml [4] packages.
  • Statistical exploration:
    • Basic exploration - distribution of attributes/labels
    • Multi-label specific exploration- labelset distribution, relationship among labels, and relationship between attributes and labels/labelsets
  • Classification:
    • Apply the classifiers (Nearest Neighbour, Random Forest, XGBoost) over the preprocessed dataset (tf-idf and term incidence) for German and English text, and for three flavours of multilabel classification methods:
      • Binary Relevance (BR) [5]
      • Label Powerset (LP) [6]
      • Classifier Chain (CC) [7]
    • Apply the classifiers (Nearest Neighbour, Random Forest, XGBoost) over the preprocessed dataset (tf-idf and term incidence) for German and English text for balanced labelsets, and for two flavours of multilabel classification methods:
      • Binary Relevance (BR)
      • Label Powerset (LP)
  • The following evaluation measures has been used primarily to assess the multilabel predictive performance:
    • Accuracy
    • Hamming Loss
    • Micro F1
    • Macro F1
  • Compare the performance of the state-of-the-art classifiers for:
    • Two languages (English and German)
    • Two kinds of features (tf-idf and incidence)
    • Three flavours of multilable classification algorithms (Binary Relevance, Label Powerset, Classifier Chain)
    • Balanced Labelsets

Data

Name and source

European Union law documents (EUR-Lex). The data is located inside the software distributed by European Union.

Data format

  • The Eurlex dataset for every language comprises two files.
  • The documents(laws/treaties) and the document categories/labels is available in a cf file (acquis.cf).
  • The content and the labels for each document has been stored in the file in the following way:

    • Every odd line consists of label-ids and the document-id of a document. The labels and document-id is separted by a #.
    • Every even line consists of the actual text.

An example has been shown below in the diagram. On the 1st line there are two lable-ids - 3032, 525 and the document-id is 31958d1006(01), and the actual text is on 2nd line.

[Fig1. Data format]

The mapping between label-id and label-name has been provided in a XML. A small snippet of the xml has been provided below.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE DESCRIPTEUR SYSTEM "descripteur.dtd">
<DESCRIPTEUR LNG="EN" VERSION="4_3">
  <RECORD>
    <DESCRIPTEUR_ID>4444</DESCRIPTEUR_ID>
    <LIBELLE>abandoned land</LIBELLE>
  </RECORD>
</DESCRIPTEUR>

The tag DESCRIPTEUR_ID contains the label-id and LIBELLE contains the label name.